#This tells python to draw the graphs "inline" - in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
from scipy.stats import norm
import statistics
import seaborn as sns
import pylab
import pandas as pd
import numpy as np
# make the plots (graphs) a little wider by default
'figure.figsize'] = (10., 8.)
pylab.rcParams[set(font_scale=1.5)
sns."white") sns.set_style(
Hypothesis Testing
Workshop 6 
For the rest of this course, we’ll be working with data from the U.S. Census Current Population Survey (CPS).
Aims:
- Understanding:
- Confidence Intervals
- Hypothesis Testing
- Stating the Null and Alternative Hypotheses
- Selecting a Critical Value
- Calculating the Test Statistic
- Making a Decision
Getting Started
We will be following on from the analysis we conducted in Workshop 5 (Distributions and Basic Statistics). We visually explored differences in the income levels between different groups of people in the US census. Now, we are going to conduct hypothesis testing to see if those differences are statistically significant.
=pd.read_csv('https://storage.googleapis.com/qm2/wk7/cps.csv')
df'race']=df['race'].astype('category')
df['income']=df['incwage']/1000
df[=df[df['income']<200]
df=df[df['year']==2013] # filter the dataframe to only contain 2013 data
df df.head()
year | state | age | sex | race | sch | ind | union | incwage | realhrwage | occupation | income | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | 2013 | 50 | 62 | 1 | 1 | 14.0 | 8090 | 1.0 | 57000.0 | 23.889143 | . | 57.0 |
32 | 2013 | 39 | 59 | 1 | 1 | 13.0 | 9590 | 0.0 | 62000.0 | 29.726475 | Consruction, Extraction, Installation | 62.0 |
34 | 2013 | 44 | 44 | 1 | 3 | 12.0 | 7290 | 0.0 | 45000.0 | 20.745834 | . | 45.0 |
36 | 2013 | 12 | 41 | 1 | 1 | 12.0 | 7070 | 1.0 | 28000.0 | 12.293828 | Managers | 28.0 |
37 | 2013 | 33 | 35 | 1 | 1 | 12.0 | 770 | 0.0 | 42500.0 | 20.377020 | Transportation and materials moving | 42.5 |
This is once again the U.S. census data from Week 5. As a reminder, our dataframe has 10 columns:
- year: Survey year
- age: the person’s age
- sex: the person’s sex
- 1=male
- 2=female
- race: the person’s race
- White non hispanic=1
- Black non hispanic=2
- Hispanic=3
- Other non hispanic=4)
- sch: Educational attainment
- None = 0,
- Grades 1-12 = 1-12
- Some University = 13,
- Associate’s degree = 14,
- BA = 16
- Advanced Degree = 18
- union: Union membership
- N/A = 0,
- No union coverage = 1,
- Member of labor union=2,
- Covered by union but not a member=3
- incwage: Wage and salary income
- realhrwage: Real Hourly Wage
- occupation: Occupation
- ind: industry code
Confidence Intervals
So far in this workshop, we’ve had the luxury of being able to draw many random samples and plot the distributions of their sample means to infer the population mean. The Central Limit Theorem lets us assume that these sample means are normally distributed, and consequently that there is a 95.45% chance that the population mean within two standard errors of the sample mean. This allows us to make inferences on the basis of a single sample. The standard error is the
Sample Standard Deviation
\[\huge s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (x_i - \overline{x})^2}\]
Standard Error
\[\huge se = \frac{s}{\sqrt{n}}\]
Given a large enough sample \(x\), we can build a 95% confidence interval as follows:
\[ \huge 95\% CI = \overline{x} \pm (1.96 \times se)\]
Let’s draw a sample of 1000 random individuals from our data, and compute a 95% confidence interval to estimate the population mean for income. We’ll begin by creating a swarmplot to get a sense of how the data are distributed.
=df.sample(1000) # draw a random sample of 1000 observations from the dataframe
sample= sample, y='income') # plot a swarmplot of income
sns.swarmplot(data 'Income Distribution') # add a title plt.title(
Text(0.5, 1.0, 'Income Distribution')
Now let’s set about calculating the 95% confidence interval and plotting it on our swarmplot. Luckily, the components we need to this are easy to calculate. We just need the mean, standard deviation, and number of observations. All of these are provided by the .describe()
function, which calculates summary statistics for a sample.
=sample['income'].describe() # calculate descriptive statistics for the sample
descprint(desc) # print the descriptive statistics
count 1000.000000
mean 47.677568
std 33.041244
min 0.700000
25% 25.000000
50% 40.000000
75% 62.000000
max 190.000000
Name: income, dtype: float64
From the set of descriptive statistics, we can pull out the relevant components, calculate the standard error, and create a 95% confidence interval as follows:
=desc['mean'] # set the mean equal to a variable called "mean"
mean=desc['std'] # set the standard deviation equal to a variable called "std"
std=desc['count'] # set the sample size equal to a variable called "n"
n=std/np.sqrt(n) # calculate the standard error of the mean
se
print('The mean is', round(mean, 2), 'with a standard error of', round(se, 2)) # print the mean and standard error
= mean+se*1.96 # calculate the upper confidence interval
upper_ci = mean-se*1.96 # calculate the lower confidence interval
lower_ci
print('The 95% confidence interval is', round(lower_ci, 2), 'to', round(upper_ci, 2)) # print the confidence interval
The mean is 47.68 with a standard error of 1.04
The 95% confidence interval is 45.63 to 49.73
Finally, let’s plot these bounds on our swarmplot to graphically show this range. We can now claim that based on our sample, there is a 95% chance that the true population mean of income (shown in red) lies within this range.
= sample, y='income',alpha=0.5) # plot a swarmplot of income
sns.swarmplot(data 'income'].mean(), color='red', linestyle='solid', linewidth=3, label='Population Mean') # add a horizontal line at the mean
plt.axhline(df[='black', linestyle='dashed', linewidth=3) # add a dashed black line at the upper confidence interval
plt.axhline(upper_ci, color='black', linestyle='dashed', linewidth=3) # add a dashed black line at the lower confidence interval
plt.axhline(lower_ci, color
'Income Distribution, 95% Confidence Interval') # add a title plt.title(
Text(0.5, 1.0, 'Income Distribution, 95% Confidence Interval')
Hypothesis Testing
If we create a boxplot of income disaggregated by sex using our sample, we can observe that men seem to earn more than women:
=sample , x='sex', y='income').set_xticklabels(['Men','Women']) # make a box plot of income by sex sns.boxplot(data
[Text(0, 0, 'Men'), Text(1, 0, 'Women')]
But is this difference statistically significant? It could just be due to sampling error, random chance. Hypothesis testing provides a framework through which we can formally evaluate the likelihood of encountering an effect as extreme (in this case, the the difference between the mean incomes between both groups) as the one we observe in our data. There are four main steps in hypothesis testing:
- State the hypotheses. H0 states that there is no difference between the two population means.
- Select an \(\alpha\) level (e.g. 95% confidence), and select a corresponding critical value (1.96 for large samples)
- Compute the test statistic.
- Make a decision; if the test statistic exceeds the critical value, we reject the null hypothesis.
Steps 1, 2, and 4 remain fairly constant regardless of what kind of hypothesis testing you’re conducting. Step 3 can vary quite a bit, as there are many different statistical tests that fall under the umbrella of hypothesis testing. In today’s workshop we’ll be using the Student’s T-Test (more on that in a second). For now, let’s begin the process of hypothesis testing a
1. State the hypotheses
The Null Hypothesis
- \(H_0\) : There is no difference in the mean income between men and women
- \(H_0\) : \(\overline{x}_{men} = \overline{x}_{women}\)
The Alternative Hypothesis
- \(H_a\) : There is a difference in the mean income between men and women
- \(H_a\) : \(\overline{x}_{men} \neq \overline{x}_{women}\)
2. Select an \(\alpha\) level
Locate the critical region; the critical values for the t statistic are obtained using degrees of freedom (\(df=n-2\)). Given that we have 1000 observations, \(df=998\). If \(df>1000\), you can simply memorize the following critical values:
- At the 95% confidence level, the critical value is 1.96
- At the 99% confidence level, the critical value is 2.58
If our test statistic exceeds either of these values, we can reject the null hypothesis with the according level of confidence. The function below creates a plot which provides a visual reference for these values, but isn’t really necessary for the process of hypothesis testing. The function accepts one argument test_statistic
, which it will use to plot a vertical red line. If the red line falls within the dotted lines, we fail to reject the null hypothesis at the corresponding confidence level. If it’s outside of these bounds, we reject the null hypothesis.
In the last line of code below, i’ve called the function to plot a test statistic of -2.3; Would we reject the null hypothesis at the 95% confidence level? what about the 99% level?
def plot_z(test_statistic):
= 0, 1 # create two variables, a mean "mu" equal to zero, and standard deviation "se" equal to 1
mu, se= np.linspace(mu - 3*se, mu + 3*se, 100) # create a range of values from -3 to 3 standard deviations
x
# plot the normal distribution
plt.plot(x, norm.pdf(x, mu, se)) -se*1.96, color='blue', linestyle='dashed', linewidth=1.5,label='µ ± 1.96σ (95% confidence)') # plot a vertical line at the mean plus 2 standard deviations
plt.axvline(mu+se*1.96, color='blue', linestyle='dashed', linewidth=1.5) # plot a vertical line at the mean minus 2 standard deviations
plt.axvline(mu-se*2.58, color='green', linestyle='dashed', linewidth=1.5,label='µ ± 2.58σ (99% confidence)') # plot a vertical line at the mean plus 2 standard deviations
plt.axvline(mu+se*2.58, color='green', linestyle='dashed', linewidth=1.5) # plot a vertical line at the mean minus 2 standard deviations
plt.axvline(mu
='red', linestyle='solid', linewidth=1.5,label='Test Statistic') # plot a vertical line at the test statistic
plt.axvline(test_statistic, color
0,0.4)
plt.ylim(
plt.legend()'Z Distribution') # add a title
plt.title(
plt.show()
-2.3) plot_z(
3. Calculate the Test Statistic (The Student’s T-Test)
The Student’s T-Test is an independent-measures design which is used in situations where a researcher has no prior knowledge about either of the two populations (or treatments) being compared. In particular, the population means and standard deviations are all unknown. Because the population variances are not known, these values must be estimated from the sample data.
The purpose of a T-test is to determine whether the sample mean difference indicates a real mean difference between the two populations or whether the obtained difference is simply the result of sampling error. Given two groups, \(x_1\) and \(x_2\), the \(t\) statistic is calculated as:
\[ \Huge t = {\frac{\overline{x_1}-\overline{x_2}} {\sqrt{\frac{s^2_1}{n_1} + \frac{s^2_2}{n_2}}}} \]
Where:
- \(\overline{x}\): Sample Mean
- \(s^2\): Sample Standard Deviation
- \(n\): Number of observations
We’ve already seen how to calculate each of these components when we made the 95% confidence interval above using the .describe()
function. To calculate the t-statistic, we just have to plug these values into the formula above and do some basic arithmetic. I’ve put together a function that does this below, which accepts two main arguments, group1
and group2
. For each group it calculates descriptive statistics, and uses these values to calculate the t-statistic. It also has an optional argument plot
, which when set to True
will plot a 95% confidence interval for each group. It defaults to False
, meaning that it won’t generate the plot.
def manual_ttest(group1, group2, plot=False): # define a function called "manual_ttest" that takes two groups and a boolean value for whether or not to plot the results as arguments
=group1.describe(), group2.describe() # get descriptive statistics for both samples
desc1, desc2
= desc1['count'], desc1['std'] ,desc1['mean'] # get the sample size, standard deviation, and mean of the first sample
n1,std1,mean1 = desc2['count'], desc2['std'] ,desc2['mean'] # get the sample size, standard deviation, and mean of the second sample
n2,std2,mean2
# calculate standard errors
= std1**2/n1, std2**2/n2 # '**2' is the same as squaring the number
se1, se2
# standard error on the difference between the samples
= np.sqrt(se1 + se2)
sed
# calculate the t statistic
= (mean1 - mean2) / sed
t_stat
# print the results
print("Group 1: n=%.0f, mean=%.3f, std=%.3f" % (n1,mean1,std1))
print("Group 2: n=%.0f, mean=%.3f, std=%.3f" % (n2,mean2,std2))
print('The t-statistic is %.3f' % t_stat) # print the t-statistic
if plot==True: # if the plot argument is set to True, plot the results
=pd.DataFrame() # create an empty dataframe
groups=1 # create a counter variable called "i" and set it equal to 1
i
for group in [group1, group2]: # loop through each group in the list of groups
=pd.DataFrame({'Values': group,'Group':i}) # create a dataframe with the values of the group and a column called "Group" that contains the group number
plot_df=groups.append(plot_df) # append the dataframe to the list of dataframes
groups+=1 # increase the counter by 1
i
=groups , x='Group', y='Values',errorbar=('ci', 95), color='black', join=False, capsize=.8) # plot the means of the groups with a 95% confidence interval
sns.pointplot(data'Comparison of Group Means with 95% Confidence Intervals') # add a title
plt.title(
return t_stat # return the t-statistic
Having defined the function, we can now call it to calculate a t-test for the difference in income between men and women
=sample[sample['sex']==1] # filter the sample to only include men
men=sample[sample['sex']==2] # filter the sample to only include women
women
= manual_ttest(men['income'],women['income']) # run the t-test function and store the t-statistic in a variable called "t" t
Group 1: n=485, mean=57.435, std=36.610
Group 2: n=515, mean=38.489, std=26.179
The t-statistic is 9.364
4. Make a Decision
If the t statistic indicates that the obtained difference between sample means (numerator) is substantially greater than the difference expected by chance (denominator), we reject H0 and conclude that there is a real mean difference between the two populations or treatments. Let plot the T-statistic from our test against the critical values:
# plot the test statistic on the z distribution plot_z(t)
Based on the plot above, can we reject the null hypothesis that there is no difference in mean income between men and women?
Exercise
- From the main dataframe
df
, draw a sample of 500 white men. Using t-tests, investigate whether there are statistically significant discrepancies in pay between white men and other groups (note: it would be best to sample 500 people in each of those groups as well). Between what groups does there exist the most significant pay gap? - Some of this variation may be due to occupation. Compare income disparities between men and women within different occupations. Which occupation has the largest pay gap? which has the smallest?
- Research suggests that within occupational groups, collective bargaining through union membership reduces pay gaps. Read the abstract of this article, and try to replicate the analysis using our dataset.
Assessed Question
When Elon musk bought Twitter, he promisted to restore “free speech” to the platform. He heralded this new era with a tweet on 28/10/2022, which read “the bird is freed”. A tidal wave of hate speech ensued instead.
Using twitter’s API, I downloaded tweets containing a racial slur. Using the groupby function and regex, I counted the number of mentions of this word per hour on the platform for about a month before the takeover, and a few days thereafter. I’ve saved these counts (but not the tweets themselves) as a csv file called “elon_tweets.csv”.
The code below downloads this csv file, and plots the number of slur-containing tweets over time.
import datetime
from matplotlib.pyplot import figure
import matplotlib.dates as mdates
=pd.read_csv('https://storage.googleapis.com/qm2/wk7/elon_twitter.csv') # read in the data
tweets=(10, 4), dpi=200)
figure(figsize'hour']=pd.to_datetime(tweets['hour'])
tweets[
=datetime.datetime(2022, 10, 28)
tweet
=tweets[tweets['hour']<tweet]['count'].mean()
pre_mean=tweets[tweets['hour']>tweet]['count'].mean()
post_mean= int(((post_mean-pre_mean)/pre_mean)*100)
pct_change
'Hourly Slur Tweets')
plt.ylabel(
'hour'], tweets['count'], 'b', color='red')
plt.plot_date(tweets[='black', linestyle='dashed', label='"the bird is freed"')
plt.axvline(tweet, color
plt.legend()'Elon Musk tweets "The bird is freed"; Tweets containing racial slurs increase {}%'.format(pct_change))
plt.title(
plt.gca().xaxis.set_major_locator(mdates.DayLocator())'%d/%m')) plt.gca().xaxis.set_major_formatter(mdates.DateFormatter(
/var/folders/6q/jt4x0r8n1rs0kbrrqrbj61fr0000gn/T/ipykernel_97088/4258694155.py:17: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "b" (-> color=(0.2980392156862745, 0.4470588235294118, 0.6901960784313725, 1)). The keyword argument will take precedence.
plt.plot_date(tweets['hour'], tweets['count'], 'b', color='red')
This plot definitely shows an uptick in the number of tweets containing a racial slur following Musk’s tweet. But is this increase statistically significant?
Question: Using a t-test and the full hypothesis testing procedure, investigate wheter there was a statistically significant increase in hate speech following Elon Musk’s tweet. Make note of the T statistic and the P value.